notes - cont. measure & intro prob

intro
measure for probability
measure and norm
dipping into functional analysis

intro

This post can be seen as a continuation of my recent post notes - probability & type which was long but ended somewhat prematurely because I felt the type-theoretic aspect was a distraction even if it motivated my initial look at measure theory in the context of intro to probability.

This is not grift, as it is merely convenient for my personal site to be hypertextual: the previous end of my connection between "measure" and "probability" is like the end of my connection between differential equations and norms (all thanks to the notorious 2011 Qiaochu Yuan, now better known as the ex-mathematician "QC").

measure for probability

Now, what eaxactly did QC say?

You should be comfortable with real analysis on the level of Rudin's Principles of Mathematical Analysis. Don't skimp on this; it's as much a maturity prerequisite as a prerequisite for actual concepts and techniques. It might also help to study a little point-set topology, just so you're used to the idea of considering a collection of subsets of a set satisfying certain axioms (QC, 2011).
I think chapters 1-7 of Rudin's Principles of Mathematical Analysis furnish sufficient preparation for measure theory. However, the notions of "pointwise convergence" and "uniform convergence" in chapter 7 of this publication are essential prerequisites that are often neglected by students intending to study measure theory (Amitesh Datta, 2011).

Source: Pre-requisites to study measure theory?

Now, since analysis is one of the most rigorous subjects in math, and Rudin is one of the most rigorous treatments of analysis, I felt out of my depth.

Then, while reading about the norm for the differential equations post, I stumbled upon a nicer thread.

The standard answer is that measure theory is a more natural framework to work in. After all, in probability theory you are concerned with assigning probabilities to events (sets)... so you are dealing with functions whose inputs are sets and whose outputs are real numbers. This leads to sigma-algebras and measure theory if you want to do rigorous analysis.
But for the more practically-minded, here are two examples where I find measure theory to be more natural than elementary probability theory:
Suppose \(X \sim \text{Uniform}(0,1)\) and \(Y = \cos(X)\). What does the joint density of \((X,Y)\) look like? What is the probability that \((X,Y)\) lies in some set \(A\)? This *can* be handled with delta functions but personally I find measure theory to be more natural.
Suppose you want to talk about choosing a random continuous function (element of \(C(0,1)\) say). To define how you make this random choice, you would like to give a p.d.f., but what would that look like? (The technical issue here is that this space of continuous functions is infinite-dimensional and so Lebesgue measure cannot be defined). This problem is very natural in the field of Stochastic Processes including Financial Mathematics -- a stock price can be thought of as a random function. Under the measure theory framework, you talk in terms of probability measures instead of p.d.f.'s, and so infinite dimensions do not pose an obstacle.
(Chris Evans, 2013)

Source: why measure theory for probability?

There are two other helpful explanations from this post. The most informative excerpt from the first of these basically says that the motivation behind learning measure theory in probability and statistics is to provide a way to measure the probability of infinite sets of events. This is important when dealing with continuous random variables and determining the probability that a variable takes on a specific value or falls within a certain range. Measure theory allows for the definition of expectations and probability density functions (pdfs) for continuous random variables. It provides a framework for assigning probabilities to sets of events where individual events have zero probability.

This motivation about pathological sets makes sense if you are coming from an analysis background, because so much effort is put into defining number systems, and functions on those number systems. So the Cantor set is an important set within the context of Analysis. However, I have never seen anyone try to assign probability to the Cantor set in any practical application. I am not saying that someone has not already done this, it is more likely that this problem was solved long ago--though I have not checked.
I think the more sensible motivation for why to learn measure theory as a probabilist or statistician has to do with finding a way to measure the probability of infinite sets of events. In probability and statistics we need a way to identify the probability that a continuous random variable takes a value less than or equal to a specific value, for example the probability density function
\[ pdf(X) := \Pr(X \leq x_0) \]
where \(x_0\) is some particular value within the support of \(X\). For example, say we want \(pdf(X) = \Pr(X \leq 0.5)\). Given that \(X\) is continuous, the set of all possible values of \(X\)--also known as the support of \(X\)--is the set \(\{x : x \in [0,1], x \in \mathbb{R} \}\). The set of all possible values in the event of interest: namely \(X \leq 0.5\) is the set \(\{x : x \in [0,0.5], x \in \mathbb{R}\}\).
Both the support of \(X\) and the event of interest are infinite sets. Intuitively, the size or cardinality of the support should be larger than the cardinality of the event of interest--though lacking a theory of measure we have no way to make that notion precise. But think if you were to try and figure out the frequency or probability of this event. You would have to divide \(\infty \div \infty\).
So measure gives us a way to assign probability to sets of event where each individual event has zero probability. Another way of saying this is that measure theory gives us a way to define the expectations and pdfs for continuous random variables.
(Krishnab, 2018)

While that answer covered a lot of territory, a coarser sweep helps nail down that measure theory provides a unified mathematical framework for handling both discrete and continuous events in probability theory. While discrete events can be described using probability mass functions, continuous events require probability density functions. Measure theory allows for a consistent treatment of these concepts and eliminates the need to distinguish between different types of events. It provides a deeper understanding of probability theory and extends to more general objects in stochastic theory.

Simple answer: Tossing a coin.
Longer answer: You know that you treat discrete events like the above with probability mass functions or similar, but continuous things with probability density functions. Imagine you had \(X\) which is randomly uniform on \([0,1]\) half the time and \(5\) the rest of the time. Perfectly reasonable thing, could easily come up. Doesn't fit into either framework.
Measure theory provides a consistent language and mathematical framework unifying these ideas, and indeed much more general objects in stochastic theory. It removes any necessity to distinguish between fundamentally similar objects and crystallizes the relevant points out, allowing much deeper understanding of the theory.
(Carl "not all wrong" Turner)

measure and norm

Recall from my connection between differential equations and norms:

To begin, a well-phrased question from Math Stack Exchange:

So if \(X\) is a vector space, and you define a norm, \(x \mapsto \| x \|\), on it, then the bounded subset, \(V = \{ x \in X: \|x\| < \infty \}\), is automatically a subspace.

This follows from the definition of a norm, so for all \(x,y \in V\), \(c \in \mathbb{C}\), \(\|x + c y\| \le \|x\| + \|cy\| = \|x\|+ |c| \|y\| < \infty\), so \(x+cy \in V\).

Is this the reasoning behind the definition of a norm?

Now, the top answer to this MSE question is from QC in 2011. While I love the guy, it is basically links to some funcitonal analysis resources. So, no.

There is a better answer from another MSE question:

Norms are inspired from the Euclidean distance function and refer to a generalized class of metrics \(d\) which for a normed linear space \(V\), satisfy the properties:

\(d(a,b) = d(a-b,0) = d(0,b-a) \quad \forall \, a,b \in V\)
\(d(\lambda u,0) = |\lambda| d(u,0) \quad \forall \, u \in V\)
\(d(a,b) \le d(a,c)+d(c,b)\)
\(d(a,b) \ge 0 \quad \) with equality \( \Leftrightarrow a=b.\)

Again, as I've said from that post and this post: function analysis offers the best persepctive for appreciating norms.

However, my favorite notes on real analysis from lovely Prof. J. K. Hunter, make uses of norms but concerning measures states:

The Lebesgue integral allows for the integration of unbounded or highly discontinuous functions that may not have Riemann integrals. It offers improved mathematical properties compared to the Riemann integral. The definition of the Lebesgue integral involves measure theory, which is beyond the scope of this discussion. However, it is worth noting that the Riemann integral is sufficient for many purposes. Even if the Lebesgue integral is needed, it is advisable to first understand the Riemann integral.

The question remains: what is the difference between measure and norm, or at least, why might they be related?

For this first one, the basic idea is this. Norm and measure are mathematical concepts used in different contexts. A norm is primarily used in vector spaces and assigns non-negative numbers to vectors, while a measure is commonly used in sets and assigns non-negative numbers to subsets. Norms operate on individual elements, while measures consider subsets. The Euclidean norm, based on the square root of the sum of squared entries, is a basic example. Measures determine lengths, areas, or volumes and extend to unions and intersections of sets. Norms are denoted as \(\|\cdot\|\), and measures are denoted as \(\mu\). The axioms of a measure allow for non-empty sets to have non-zero measures, unlike norms where non-zero vectors always have non-zero norms. Pseudo-norms may assign zero to non-zero vectors but are not typically classified as norms.

Norm is mainly used in the context of vector spaces, i.e., a set equipped with a structure that enables us to sum up elements of the set and multiply them with scalars (e.g., real numbers).
Measure is mostly used in the context of sets which are not required to have any additional structure. Yet, to define the notion of measure, we need to require certain rules (about the open subsets of the given set and their unions and intersections) to hold.
Basic example of a norm would be the Euclidean norm (square all the entries of a given vector, sum them up and take the square root - classic). Basic example of a measure would be the length of an interval, area of a two-dimensional region, or a volume of some higher-dimensional object. But the objects we want to measure are not just these elementary shapes, but also their unions and intersections (so things may get quite complicated as far as the general element to be measured is concerned - unlike with the case of vector spaces where the element may be uniquely represented with a sequence of numbers (components of the vector)).
Let our set be denoted by \(S\). Both the norm and the measure are certain functions on a set under consideration. A bit more precisely, norm (denoted \(\|\cdot\|\)) takes only one argument from \(S\) and spits out a non-negative number, i.e., it is a mapping \(\|\cdot\| \colon S \rightarrow \mathbb{R}\) satisfying certain rules such as the norm of a zero vector to be zero. On the other hand, a measure takes a subset of \(S\) and spits out a non-negative number. So \(\mu \colon \Omega(S) \rightarrow \mathbb{R}\) is a function considered on a certain set \(\Omega(S)\) consisting of subsets of \(S\), and we require some axioms about \(\Omega(S)\) to hold (so actually some structure is hidden somewhere else anyway, yet different structure than the vector spaces have) in order for the measure to be definable nicely. Moreover, other axioms must be satisfied by \(\mu\) to be a function called measure (e.g., that the empty set has zero measure).
Note: The axioms of a measure do not require non-empty sets to be of zero measure - unlike the case of a norm where a non-zero vector always has a non-zero norm... well, actually we can consider a type of norm which assigns zero to a non-zero vector, but those are usually called pseudo-norms.
(Radek Suchánek, 2017)

Source: Difference between a measure and a norm in a euclidean space

Next, from the same source, is a more definition focused explanation. A norm is a function that assigns a non-negative number to a vector and behaves like a length function. It satisfies properties such as non-negativity, the triangle inequality, and homogeneity. The Euclidean norm in \(\mathbb{R}^2\) is given as an example. On the other hand, a measure is a function that assigns non-negative numbers to sets. It satisfies properties such as non-negativity, assigning measure zero to the empty set, and countable additivity for disjoint sets. The Lebesgue Measure is mentioned as an example. The interpretation of objects in terms of norms and measures is discussed. The norm of a line depends on how it is viewed, with the length interpretation being relevant when considering the line as a vector. However, when considering the line as a subset of \(\mathbb{R}^2\), the measure of the line (area) is typically zero.

I think that if we see the definitions of norm and measure, we can make the situation clearer.
A norm is usually defined in a vector space and is a function that takes a vector and outputs a non-negative number. It is also required to have certain properties that make it behave like a length function. These axioms are:
\(||x|| > 0\) if \(x \ne 0\) and \(\|x\| = 0\) if \(x = 0\) (as elements of the vector space you have)
\(\|x + y\| \le \|x\| + \|y\|\) for all vectors (this is the analogue of the triangle inequality)
\(\|αx\| = |α|\cdot\|x\|\) for all vectors \(x\) and scalars \(α\)
The most usual example is the Euclidean norm in \(\mathbb{R}^2\) which, given a vector \(u=(x,y)\), gives \(\|u\| = \sqrt{x^2 + y^2}\).
Now, a measure is a function that takes sets (which, in a familiar setting like \(\mathbb{R}^2\), can be thought of as shapes) and outputs non-negative numbers. The corresponding axioms are:
\(μ(A) \ge 0\) for all sets \(A\)
The empty set has measure zero
If we have a family of disjoint sets \(A_i\), then \(μ\left(\bigcup_{i}^{\infty} A_i\right) = \sum_{i}^{\infty} μ(A_i)\)
The most usual measure is the Lebesgue Measure, which, in good sets, coincides with our sense of length in \(\mathbb{R}^1\), area in \(\mathbb{R}^2\), and volume in \(\mathbb{R}^3\).
Now, for your question: The answer depends on how you view your objects. If you want to see the line as a vector (with, say, the start to the origin), then the norm would be the length of the line.
Now it gets trickier because the standard measure in \(\mathbb{R}^2\) is area, so it is expected (and that's what happens) for every line, segment, or infinite to have measure (area) 0. But if you identify the line with \(\mathbb{R}^1\), then you would have as measure what we usually understand as length.
To make this clear, let's say you have the line going from \((0,0)\) to \((1,1)\). This can be represented as:
the vector: \(u=(1,1)\)
the subset of \(\mathbb{R}^2\): \(L = \{ p = (x,y) \ | \ x,y \in \mathbb{R}^2 \ \land \ (x,y) = (x,x) \ \land \ 0 \le x \le 1 \}\)
Then the norm \(\|u\| = \sqrt{2}\), and by identifying \(L\) with the set \(L' = [0,\sqrt{2}]\), you have \(\lambda^1(L') = \sqrt{2}\) with \(\lambda'\) being the measure of \(\mathbb{R}^1\). (Note that as we said above, the Lebesgue measure of \(\mathbb{R}^2\) would give the area of the line, so \(\lambda^2(L) = 0\)).

dipping into functional analysis

To broach this topic, I begin with another thread. This response considers a measure space \((S, \Sigma, \mu)\) and a normed vector space \(\mathcal{L}^2(\mu)\) , which consists of functions with finite square integrals (Another way to define the integrability of a function is by considering the square of the function instead of its absolute value. In this case, for the square of the function to be Lebesgue integrable, both the integrals of the positive and negative parts of the real component, as well as those of the imaginary component, must be finite.). For a measurable function \(f\) belonging to \(\mathcal{L}^2(\mu)\), the norm is defined as the square root of the integral of \(f^2\) with respect to \(\mu\), resembling the concept of "length" similar to the Euclidean norm in \(\mathbb{R}^n\). Moving on to a probability space \((\Omega, \mathcal{F}, P)\) and the normed vector space \(\mathcal{L}^2(P)\), the norm of a random variable \(X\) is defined as the square root of the expectation of \(X^2\). However, this is not the "length" of \(X\) unless the expected value of \(X\) is zero. In practice, the standard deviation is often considered as the "length" of \(X\) by subtracting the expected value \(\mu\) from \(X\). This brings up questions about why the subtraction of \(\mu\) is necessary to interpret it as the "length" and how to understand \(X - \mu\) when thinking of \(\mathcal{L}^2(P)\) as a vector space over \(\mathbb{R}\). Additionally, there is an update discussing the desire to apply Euclidean geometry intuition to visualize the correlation coefficient \(\rho\) as the cosine of the angle between random variables \(X\) and \(Y\), and the challenges in doing so due to differences in norms and inner products between vector spaces and \(\mathcal{L}^2(P)\).

The question posed from "Lengths" of Random Variables in Infinite Dimensional Spaces is as follows:

Consider a measure space \((S, \Sigma, \mu)\) and the normed vector space \(\mathcal{L}^2(\mu)\). Then for any measurable function \(f: S \to \mathbb{R}\) with \(f \in \mathcal{L}^2(\mu)\), the norm is defined as
\[ ||f||_{\mathcal{L}^2(\mu)} := \left(\int_S f^2(x) \, \mu(dx)\right)^{1/2}, \]
and, as I understand it, this is exactly analogous to the "length" of \(f\), just as the Euclidean norm is the "length" of a vector in \(\mathbb{R}^n\).
Now consider a probability space \((\Omega, \mathcal{F}, P)\) and the normed vector space \(\mathcal{L}^2(P)\). Let \(X: \Omega \to \mathbb{R}\) be a random variable. Then for any such \(X \in \mathcal{L}^2(P)\), we define the norm of \(X\)
\[ ||X||_{\mathcal{L}^2(P)} := \left( \int_{\Omega} X^2(\omega) \, P(d\omega)\right)^{1/2} = \left(E\left(X^2\right)\right)^{1/2}, \]
but this is *not* the "length" of \(X\) unless \(E(X) = 0\). Instead, we usually think of the standard deviation as the length of \(X\) by subtracting \(\mu := E(X)\) from \(X\) first:
\[ std(X) := \left(E\left(\left(X - \mu\right)^2\right)\right)^{1/2}. \]
This brings up two questions:
Why must we subtract \(\mu\) to interpret this as the "length"?
I'm thinking of \(\mathcal{L}^2(P)\) as a vector space over \(\mathbb{R}\). Since \(\mu \in \mathbb{R}\), what does \(X - \mu\) mean? In Euclidean space, it doesn't make sense to write \(\vec{x} - c\) for \(\vec{x} \in \mathbb{R}^n\) and \(c \in \mathbb{R}\).
Update I want to be able to apply the intuition of Euclidean geometry to visualize things like the correlation coefficient \(\rho\) as the cosine of the angle between two random variables \(X\) and \(Y\), as explained here. In \(\mathbb{R}^n\), the cosine of the angle between two vectors \(\vec{x}\) and \(\vec{y}\) is related to their inner product and lengths by
\[ \cos \theta = \frac{\langle\vec{x}, \vec{y}\rangle}{||\vec{x}||\cdot||\vec{y}||}. \]
If \(||X||_{\mathcal{L}^2(P)}\) were indeed the "length" of \(X\), then I feel like I should just be able to change the norms and inner products, but I can't because
\[\cos \theta = \frac{\left<X,Y\right>_{\mathcal{L}^2(P)}}{||X||_{\mathcal{L}^2(P)}\cdot||Y||_{\mathcal{L}^2(P)}} \neq \frac{\left<X - \mu_X, Y - \mu_Y\right>_{\mathcal{L}^2(P)}}{||X - \mu_X||_{\mathcal{L}^2(P)}\cdot||Y - \mu_Y||_{\mathcal{L}^2(P)}} = \frac{Cov(X,Y)}{std(X)\cdot std(Y)} = \rho\]

Now, consider the first, rather dismissive, answer. The basic idea is that the interpretation of norms in function spaces as lengths can be misleading and hinder understanding. Instead, it is more accurate to consider norms as offering a vague notion of size rather than geometric intuition. In vector spaces, norms are typically measured against the zero vector to maintain translation invariance. The choice of reference function in norm calculations can capture the distance between functions. For instance, in \(L^{2}(\mathbb{R})\), norms can be taken with respect to different functions to measure their separation. The standard deviation example is specific to how it is defined as a result of linear functionals acting on the normed space. It is important to note that this is unrelated to the principles of vector spaces. Visualizing infinite-dimensional Hilbert spaces presents challenges due to our limited ability to visualize beyond three dimensions.

Length is not really the best way to think about norms in function spaces. It is much better to think of the norm as offering a vague notion of size. Thinking of the norm in function spaces as length just connotes too much geometric intuition where there is none. I think you are falling into a common trap when starting out in functional analysis, which is to try to interpret everything geometrically. I’ll just say that this is a very problematic way to approach the subject and will cause you more headaches than it will insight.
The reason that the norm is always taken against the zero vector by default is because in vector spaces, the length of vectors is translation invariant. For example, in \(L^{2}(\mathbb{R})\), we take the norm against the zero function by default \[\lvert\lvert f \rvert\rvert_{L^{2}}=\lvert\lvert f-0\rvert\rvert=\left(\int_{E \subseteq \mathbb{R}} \lvert f-0 \rvert^{2}\, d\mu \right)^{1/2}\] but we could just as easily pick another function, say \(g \in L^{2}(\mathbb{R})\) and write \[\lvert\lvert f-g \rvert\rvert_{L^{2}}=\left(\int_{E \subseteq \mathbb{R}} \lvert f -g \rvert^{2}\, d\mu \right)^{1/2}\] which tells us how far apart these functions are in some sense (and there are many senses which do not coincide with your geometric intuition). Your standard deviation example just follows from the fact that you have a linear functional \(\mathbb{E}[\cdot]:L^{2}(\Omega,\mathcal{F},\mathbb{P}) \to \mathbb{R}\) composed with another linear functional \(\sigma[\cdot]:L^{2}(\Omega,\mathcal{F},\mathbb{P}) \to \mathbb{R}\) given by \[\sigma[X]=(\mathbb{E}[(X-\mathbb{E}[X])^{2}])^{1/2}\] and this has nothing to do with how vector spaces work, it is just because of how \(\sigma[X]\) is defined. It is the "distance" (or how much "size" is between them) from the constant function whose value is given by the functional \(\mathbb{E}[X]\).
As for your second question, like I said before, our brains are limited to visualizing in 3 dimensions. Trying to "visualize" in an infinite dimensional Hilbert space simply isn’t going to work.

Lastly, in this explanation discusses different vector spaces and their associated norms. Examples of vector spaces include \(\mathcal{R}^n\), the n-dimensional space of real numbers, and \(L^2(\mu)\). Norms are defined as non-negative real-valued functions that satisfy specific properties. For instance, in \(L^2(\mu)\), the norm is defined as \(\|f\|_2 = \left(\int f^2 d\mu\right)^{1/2}\), while in Euclidean space, the norm of a vector \(x\) is given by \(\|x\| = \left(\sum_{i=1}^{n}x_i^2\right)^{1/2}\), representing the familiar Euclidean distance. Normed spaces also have an induced metric, which serves as a measure of distance between elements. The variance of a random variable \(X\) is related to the norm through \(Var(X) = \|\mathbb{X} - \mu\|_2^2\), where \(\mu\) is the expectation of \(X\). The standard deviation corresponds to the actual norm, while the covariance between two random variables \(X\) and \(Y\) is defined as their inner product. Notably, the \(L_2(P)\) norm can be derived as a special case of the inner product when applied to \(X = Y\) and \(E(X) = 0\), but in general, norms in \(L_2(P)\) spaces cannot be obtained from an inner product.

Let \(V\) be a vector space. Examples of \(V\) are, \(V=\mathcal{R}^n\), the n-dimensional space of reals or \(V=L^2(\mu)=\{f:\int f^2 d\mu<\infty\}\).
A norm \(\vert\vert\cdot\vert\vert\) is non negative real-valued function on \(V\) which satisfies the three well known properties.
For \(f\in L^2(\mu)\), a norm is \(\vert\vert f\vert\vert_2=(\int f^2 d\mu)^{1/2}\).
In a Euclidean space, for \(x=(x_1,\ldots,x_n) \in \mathcal{R}^n\) a norm is defined as \(\vert\vert x\vert\vert=(\sum^{n}_{i=1}x_i^2)^{1/2}\)
In a normed space there is an induced metric(a "distance" between its elements) defined by: \(\vert\vert v-w\vert\vert,\;v,w\in V\).
For \(f,g\in L^2(\mu)\), \(\vert\vert f-g\vert\vert_2=(\int (f-g)^2 d\mu)^{1/2}\) and for \(x,y\in \mathcal{R}^n\) the induced metric is \(\vert\vert x-y\vert\vert=(\sum^{n}_{i=1}(x_i-y_i)^2)^{1/2}\) which is the well known Euclidean distance between two vectors.
So, the length of a vector is actually the Euclidean norm on \(\mathcal{R}^n\).
The variance is a special case. \(X\in \mathcal{R}\) is random variable ie a measurable function \(\mu\in \mathcal{R}\) is a constant, the expectation of \(X\), and \(P\) is a probability measure on \(X\) then \[Var(X)=\vert\vert X-\mu\vert\vert_2^2=\int(X-\mu)^2dP=E(X-\mu)^2\]
The standard deviation is the actual norm.
The covariance between two random variables \(X,Y\) is the inner product \(<X-E(X),Y-E(Y)>\;=\int (X-E(X))(Y-E(Y))dP\).
Notice that this inner product can produce the \(L_2(P)\) norm as special case for \(X=Y,\;E(X)=0\). \[<X,X>\;=\int X^2 dP=\vert\vert X\vert\vert_2^2\] This is the known formula \(Cov(X,X)=Var(X)\).
In general norms in \(L_2(P)\) spaces cannot be produced from an inner product.

With all that in mind, even this brief dip into functional analysis to clarify norms and measures is not very helpful to someone who is for the first time learning differential equations or probability thoery.

table of contents

intro

measure for probability

measure and norm

dipping into functional analysis